negative component
comments. Reviewer # 1 wants to see an algorithm that works when b
We thank all the reviewers for their time and valuable comments. "Provide an algorithm to output a distribution that's close to the target, even if b has negative components." We will mention this in the paper. This is an interesting direction for future research. "What happens when we increase the number of layers?"
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Chanin, David, Dulka, Tomáš, Garriga-Alonso, Adrià
It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Importantly, our work shows that SAE width is not a neutral hyperparameter: narrower SAEs suffer more from hedging than wider SAEs. As large language models (LLMs) are deployed in real-world applications, it is increasingly important to understand their internal workings. SAEs have the advantage of operating completely unsupervised, and can easily be scaled to millions of neurons in its hidden layer (hereafter called "latents" While SAEs showed promising results, recent work has cast doubt on the performance of SAEs relative to baseline techniques. Wu et al. (2025) show that SAEs underperform on both concept steering and detection relative to baselines, and Kantamneni et al. (2025) show that SAEs underperform simple linear probes on both in-domain and out-of-domain detection, even when the probes have very few training samples. The question, then, is why do SAEs underperform relative to other techniques? And if we can identify the problems holding back SAEs, can we then fix those problems?
- North America > United States (0.28)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > Middle East > Jordan (0.04)
Reviews: Learning Distributions Generated by One-Layer ReLU Networks
A popular generative model these days is as follows: pass a standard Gaussian noise through a neural network. But a major unanswered question is what is the structure of the resulting distribution? Given samples from such a distribution, can we learn the distribution parameters? This question is the topic of this paper. Specifically, consider a 1-layer ReLU neural network, which is specified by a matrix W and a real bias b.
Unveiling Concept Attribution in Diffusion Models
Nguyen, Quang H., Phan, Hoang, Doan, Khoa D.
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize layers storing knowledge in generative models without showing how those layers contribute to the target concept. In this work, we approach the model interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show there exist components that contribute negatively to a concept, which has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models. Our code is available at \url{https://github.com/mail-research/CAD-attribution4diffusion}
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Vietnam (0.04)
Deriving Activation Functions via Integration
Activation functions play a crucial role in introducing non-linearities to deep neural networks. We propose a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding functions through integration. Our work introduces the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by integrating trainable affine transformations applied on the ELU activation function. xIELU combines two key gradient properties: a trainable and linearly increasing gradient for positive inputs, similar to ReLU$^2$, and a trainable negative gradient flow for negative inputs, akin to xSiLU. Conceptually, xIELU can be viewed as extending ReLU$^2$ to effectively handle negative inputs. In experiments with 1.1B parameter Llama models trained on 126B tokens of FineWeb Edu, xIELU achieves lower perplexity compared to both ReLU$^2$ and SwiGLU when matched for the same compute cost and parameter count.
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > British Indian Ocean Territory > Diego Garcia (0.04)
Fundamental Limitations of Alignment in Large Language Models
Wolf, Yotam, Wies, Noam, Avnery, Oshri, Levine, Yoav, Shashua, Amnon
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback make the LLM prone to being prompted into the undesired behaviors. This theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatGPT jailbreaks", where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety. A growing concern due to the increasing reliance on LLMs for such purposes is the harm they can cause their users, such as feeding fake information (Lin et al., 2022; Weidinger et al., 2022), behaving offensively and feeding social biases (Hutchinson et al., 2020; Venkit et al., 2022; Weidinger et al., 2022), or encouraging problematic behaviors by users (even by psychologically manipulating them Roose (2023); Atillah (2023)). The act of removing these undesired behaviors is often called alignment (Yudkowsky, 2001; Taylor et al., 2016; Amodei et al., 2016; Shalev-Shwartz et al., 2020; Hendrycks et al., 2021; Pan et al., 2022; Ngo, 2022). There are several different approaches to performing alignment in LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (6 more...)
- Overview (0.68)
- Research Report > New Finding (0.34)
Prototypical Networks for Multi-Label Learning
Yang, Zhuo, Han, Yufei, Yu, Guoxian, Zhang, Xiangliang
We propose to address multi-label learning by jointly estimating the distribution of positive and negative instances for all labels. By a shared mapping function, each label's positive and negative instances are mapped into a new space forming a mixture distribution of two components (positive and negative). Due to the dependency among labels, positive instances are mapped close if they share common labels, while positive and negative embeddings of the same label are pushed away. The distribution is learned in the new space, and thus well presents both the distance between instances in their original feature space and their common membership w.r.t. different categories. By measuring the density function values, new instances mapped to the new space can easily identify their membership to possible multiple categories. We use neural networks for learning the mapping function and use the expectations of the positive and negative embedding as prototypes of the positive and negative components for each label, respectively. Therefore, we name our proposed method PNML (prototypical networks for multi-label learning). Extensive experiments verify that PNML significantly outperforms the state-of-the-arts.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > France (0.04)
- Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
- Asia > China > Chongqing Province > Chongqing (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.88)
From Raw Sensor Data to Detailed Spatial Knowledge
Zhang, Peng (Australian National University) | Lee, Jae Hee (Australian National University) | Renz, Jochen (Australian National University)
Qualitative spatial reasoning deals with relational spatial knowledge and with how this knowledge can be processed efficiently. Identifying suitable representations for spatial knowledge and checking whether the given knowledge is consistent has been the main research focus in the past two decades. However, where the spatial information comes from, what kind of information can be obtained and how it can be obtained has been largely ignored. This paper is an attempt to start filling this gap. We present a method for extracting detailed spatial information from sensor measurements of regions. We analyse how different sparse sensor measurements can be integrated and what spatial information can be extracted from sensor measurements. Different from previous approaches to qualitative spatial reasoning, our method allows us to obtain detailed information about the internal structure of regions. The result has practical implications, for example, in disaster management scenarios, which include identifying the safe zones in bushfire and flood regions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation
Chen, Dongdong (Sichuan University) | Lv, Jian Cheng (Sichuan University) | Yi, Zhang (Sichuan University)
The local neighborhood selection plays a crucial role for most representation based manifold learning algorithms. This paper reveals that an improper selection of neighborhood for learning representation will introduce negative components in the learnt representations. Importantly, the representations with negative components will affect the intrinsic manifold structure preservation. In this paper, a local non-negative pursuit (LNP) method is proposed for neighborhood selection and non-negative representations are learnt. Moreover, it is proved that the learnt representations are sparse and convex. Theoretical analysis and experimental results show that the proposed method achieves or outperforms the state-of-the-art results on various manifold learning problems.